AITopics | text-to-image alignment

To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment.

arxiv preprint arxiv, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Israel (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

Neural Information Processing SystemsDec-27-2025

The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While revolutionary, as the complexity of given text input increases, the current state of art diffusion models may still fail in generating images that accurately convey the semantics of the given prompt. Furthermore, such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper, we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment.

iterative vqa feedback, name change, text-to-image alignment, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

GenEval: An object-focused framework for evaluating text-to-image alignment

Neural Information Processing SystemsDec-26-2025, 11:52:03 GMT

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a distribution-level measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative reasoning capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models.

geneval, name change, object-focused framework, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Neural Information Processing SystemsOct-10-2025, 19:04:28 GMT

This project was completed during work at Google 38th Conference on Neural Information Processing Systems (NeurIPS 2024).

diffusion model, reference image, reward function, (16 more...)

Neural Information Processing Systems

Country: North America > Canada (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology > Services (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback A Additional Results

Neural Information Processing SystemsOct-9-2025, 09:39:10 GMT

Each image is a given rating between 1 and 5 (where 1 represents that'image is semantically

input prompt, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Israel (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

dfd0bd56e8a6f82d1619f5d093d5f9ca-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 09:39:05 GMT

arxiv preprint arxiv, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Israel (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

Neural Information Processing SystemsJan-20-2025, 00:28:56 GMT

The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While revolutionary, as the complexity of given text input increases, the current state of art diffusion models may still fail in generating images that accurately convey the semantics of the given prompt. Furthermore, such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper, we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. The alignment of each assertion with generated images is then measured using a VQA model.

assertion, iterative vqa feedback, text-to-image alignment, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

GenEval: An object-focused framework for evaluating text-to-image alignment

Neural Information Processing SystemsJan-19-2025, 17:44:10 GMT

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a distribution-level measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color.

geneval, object-focused framework, text-to-image alignment, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Filters

Collaborating Authors

text-to-image alignment

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

df40e4147b55744ab782b09ccde054db-Paper-Conference.pdf

dfd0bd56e8a6f82d1619f5d093d5f9ca-Supplemental-Conference.pdf

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback Jaskirat Singh

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

GenEval: An object-focused framework for evaluating text-to-image alignment

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback A Additional Results

dfd0bd56e8a6f82d1619f5d093d5f9ca-Paper-Conference.pdf

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

GenEval: An object-focused framework for evaluating text-to-image alignment